You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

Gradient Descent animation: 1. Simple linear Regression

This is the first part of a series of articles on how to create animated plots visualizing gradient descent.

The Gradient Descent method is one of the most widely used parameter optimization algorithms in machine learning today. Python’s celluloid-module enables us to create vivid animations of model parameters and costs during gradient descent.

In this article, I exemplarily want to use simple linear regression to visualize batch gradient descent. The goal is to build a linear regression model and train it on some data points we made up. For every training round (‘epoch’), we intend to store current model parameters and costs. Finally, we aim to create some animations with our stored values.

Set up the model

There are various articles on how to set up and fit a linear regression model on Medium alone. In theory, we are trying to find the best fitting, straight line for our data. Mathematically, a straight line in two-dimensional space can be described with the following function: y= w*x+b, with w representing the slope (or “weight) and b representing the y-intercept (or “bias”) of our line. There are numerous methods on how to determine the optimal values for w and b given our n data points. The gradient descent algorithm aims to minimize the mean squared error between observed data points (y) and points we predicted with our regression line (ŷ). The mean squared error is also being referred to as ‘cost function’ (or ‘costs’) usually denoted as J.

With our data points fixed, the cost function is only dependent on the parameters w and b. We aim to adjust our parameters until the cost function reaches its minimum. To get a glimpse on how to adjust (decrease vs. increase) the parameters we introduce the gradient of our cost function ∇J(w,b):

with δJ/δw and δJ/δb being the partial derivatives of J with respect to w and b respectively. By constantly moving our parameters in the opposite direction of the current gradient ∇J, we can stepwise reduce the costs J. The size of the steps we take to reach the (local/global) minimum of J is usually denoted as α and is also referred to as ‘learning rate’. When training our model, our objective is to repeat the following for each epoch until we reach convergence:

Metaphorically speaking, the cost function can be imagined as some mountainous terrain, where beginning from a certain starting point, we want to head downhill until we reach a valley. Analogously, the gradient is giving us the ‘multidimensional slope’ i.e. the direction of where ‘uphill’ is on the mountain surface. That’s why we intend to constantly adjust our parameters in the opposite direction of the gradient.

Gradient descent algorithm can be sub-classified according to how much of the training data is being used simultaneously to compute the gradient of our cost function. In the following example, we use the entire dataset for every update respectively, which is also referred to as batch gradient descent. In Python we import some useful libraries and set up our simple linear regression model:

We then want to introduce our training data, define the learning rate (α=0.001), initialize our starting parameters (w=3, b=-1), and finally train our model. For every epoch, we store the updated values of our parameters, the costs, and some particular predicted y-values in lists. List items are then being stored in numpy arrays where they serve as raw data for our animated plots.

Note that the particularly small learning rate of 0.001 was chosen on purpose to prevent overly large steps during the first epochs of gradient descent. A larger learning rate (e.g. α=0.1) usually results in faster model convergence requiring fewer epochs. However, overly large steps during the first epochs of gradient descent tend to result in less appealing animations or even failure to converge. Just to make sure our fitted parameters converged to their true values, we verify our results with sklearn’s inborn linear regression model.

Since we are now confident that our gradient descent algorithm worked out as planned, we can move on to the animations.

Animations of gradient descent (simple linear regression):

With our data points being generated and stored, we can now start to build some animations. We could, for example, plot the values our cost function and parameters take on with respect to the epoch while plotting the corresponding regression line simultaneously:

These rather basic plots reveal a very important characteristic of gradient descent: if set up correctly, costs drop rapidly and parameter values change noticeably at the beginning of gradient descent. With rising epochs, only minor changes to costs and parameter values can be observed. Therefore, plotting all values we initially stored seems unfavorable. While predominantly focusing on the first epochs of the fitting process we can visualize most of the ‘action’ without crashing Python while generating these resource-intensive animations. After trying out some different selections of points to plot, I decided to utilize the first 50 epochs of the fitting process continuously, followed by plotting only every 5th or 200th data point until the number of epochs reaches 12,000. In the following piece of code, we define the epochs we intend to incorporate in our plots and create the animation above through snapshots after each for-loop. By calling Camera’s animate-function we can turn our snapshots into animations.

In my opinion, it makes sense to return the final values of J, w and b being plotted, so that we can make sure we roughly visualized model convergence in our animation despite not using all the points we stored during the fitting process. Especially in 3D-animations, it can sometimes be difficult to confirm the former just by looking at the graph.

A more intriguing way of visualizing gradient descent can be obtained if we plot the cost function with respect to the parameters w and b, since this is closer to the actual concept of J being a function of w and b. In addition, we seek to introduce connection lines (dashed) between the regression line and our training data to portray the respective residuals.

Some minor changes to our previous piece of code get us the animation above:

Referring to the aforementioned ‘mountain’-analogy, creating a 3D visualization of gradient descent seems desirable. However, this requires some preliminary work since we have to create some data points we never encountered during the fitting process. In other words, we need to compute costs for every possible pair of w and b over a predefined range of parameter values to obtain a surface plot. Fortunately, numpy has a built-in function called meshgrid, which enables us to create coordinate grids for our three-dimensional plots.

With the following code, we can now visualize gradient descent in three dimensions.

I hope you enjoyed this article. If anything is unclear or if you have noticed any mistakes, please feel free to leave a comment. In the next article, I will address animations of gradient descent using the example of multilinear regression. The complete notebook can be found on my GitHub. Thank you for your interest!

References: